Improving Imputation Accuracy in Ordinal Data Using Classification
نویسندگان
چکیده
Tackling missing data is one of the fundamental data pre-processing steps. Data analysis and pattern extraction are affected due to the underlying differences between instances with and without missing data. This is a particular problem with ordinal data, where for example a sample of a population may have all failed to answer a specific question in a questionnaire. The existing methods such as listwise deletion, mean attribute substitution, and regression substitution, naively impute data. They do not impute plausible values as they fail to take into account the relationships between the attributes, but instead consider the distribution of the attribute with missing values only. In this paper we introduce the use of Classification Based Imputation (CNI) to replace missing values with plausible values in ordinal data. The results show that not only does the CNI based technique outperform the existing approaches for imputing missing values in ordinal data but it also helps to improve the classification accuracy of machine learning algorithms.
منابع مشابه
ارزیابی صحت پیشبینی ژنومی در معماریهای مختلف ژنومی صفات کمی و آستانهای با جانهی دادههای ژنومی شبیهسازیشده، توسط روش جنگل تصادفی
Genomic selection is a promising challenge for discovering genetic variants influencing quantitative and threshold traits for improving the genetic gain and accuracy of genomic prediction in animal breeding. Since a proportion of genotypes are generally uncalled, therefore, prediction of genomic accuracy requires imputation of missing genotypes. The objectives of this study were (1) to quantify...
متن کاملImputation of parent-offspring trios and their effect on accuracy of genomic prediction using Bayesian method
The objective of this study was to evaluate the imputation accuracy of parent-offspring trios under different scenarios. By using simulated datasets, the performance Bayesian LASSO in genomic prediction was also examined. The genome consisted of 5 chromosomes and each chromosome was set as 1 Morgan length. The number of SNPs per chromosome was 10000. One hundred QTLs were randomly distributed a...
متن کاملEffect of Reference Population Size and Imputation Methods on the Accuracy of Imputation in Pure and Mixed Populations
Imputation as a method of creating low-density chips to high-density chips has been introduced to increase the accuracy of genomic selection in animals. In the current study, to investing imputation accuracy, three populations of mixed (scenario 1), pure (scenario 2) and mixed + pure (scenario 3) were simulated using QMSim. Two methods of imputation including Beagle and Flmpute were used fo...
متن کاملچند رویکرد برخورد با مقادیر گمشده متغیرهای کمی و بررسی اثر آنها بر نتایج حاصل از یک کارآزمایی بالینی
Background and Objectives: A major challenge that affects the longitudinal studies is the problem of missing data. Missing in the data may result in the loss of part of the information which reduces the accuracy of the estimator and obtain the results will be biased and inaccurate. Therefore, it is necessary to evaluate the missing data mechanism from a longitudinal research and to consider thi...
متن کاملSFLA Based Gene Selection Approach for Improving Cancer Classification Accuracy
In this paper, we propose a new gene selection algorithm based on Shuffled Frog Leaping Algorithm that is called SFLA-FS. The proposed algorithm is used for improving cancer classification accuracy. Most of the biological datasets such as cancer datasets have a large number of genes and few samples. However, most of these genes are not usable in some tasks for example in cancer classification....
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016